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1. INTRODUCTION 

An important tool in linear algebra is the matrix decomposition, which expresses a 
(rectangular) matrix as a product of two or more simpler matrices. Such decompo- 
sitions are used for easy computation of rank, null space, and solving linear system 
and related problem. 

There are many well-known algorithms for decomposition defined in any field, 
finite or not. A common approach consists in reducing a matrix usually to the 
row-echelon form by row operations [Meyer 2000] . Once the row-echelon form is 
obtained the rank will be equal to the number of non-zero rows and null space can 
be easily computed. Gauss LU decomposition [Golub and Van Loan 1996] can be 
also used to to solve linear systems (when the matrix is square and full rank) or 
compute the rank. Applied to rectangular or rank deficient matrices, it is costly as 
the computation of the row-echelon form. 

In fact, Gauss decomposition of a matrix A produces an LU factorization, i.e. 
PAQ = LU where P and Q are permutation matrices, and U is the row-echelon 
form of A (up to column permutation) . The asymptotic cost of a naive implemen- 
tation of LU decomposition for a dense n x n matrix is 0(n 3 ). However such a cost 
can be reduced using a combination of recursion and matrix-matrix multiplication. 

For example, using matrix-matrix multiplication (MMM) in the construction of 
the factorization, the asymptotic cost may be reduced by using faster MMM algo- 
rithms. The problem of fast matrix-matrix multiplication is still an open problem. 

The naive MMM algorithm is based on the classical definition of the multiplica- 
tion of two matrices; its cost is n 3 multiplications and n 2 (n — 1) additions and so 
we classify the naive algorithm as an 0(n 3 ) algorithm. 
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Strassen matrix-matrix multiplication algorithm [Strassen 1969] - which asymp- 
totic cost is C(n log2 7 ) in any field - uses only seven scalar multiplications (instead 
of the usual eight) to multiply 2x2 matrices. In fact, as proved in [Winograd 
1971], Strassen's algorithm is optimal for 2x2 matrices. Further asymptotic im- 
provements [Coppersmith and Winograd 1990] can be obtained to perform multi- 
plication of larger matrices. Hybrid algorithms incorporate Strassen and Winograd 
variants recursively to achieve high performance on large matrices [Huss-Lederman 
et al. 1996; Higham 1990; Douglas et al. 1994; Kaporin 1999]. The asymptotic 
cost of 0(n log2 7 ) means that for a large enough n, Strassen's algorithm should 
theoretically perform multiplication significantly faster than the naive algorithm. 
However, asymptotic cost means that the actual cost of standard LU decomposi- 
tion is about C\r? while the actual cost of Strassen multiplication is about C 2 n log2 7 
where C 2 3> C\. Therefore, the use of Strassen algorithm is convenient only for 
large n. Strassen algorithm is recursive so that normally the recursion is terminated 
when the cost of recursion is larger than the classical matrix-matrix multiplication. 
That happens when C 2 n l °^ 7 w C\n 3 , i.e. n w exp(ln(C 2 /Ci)/(3 - log 2 7)). For 
instance, if C2/C1 « 10 we have n w 150000 while in case C2/C1 ~ 5 we have 
n w 4000. In practice, the computation of the switching point must consider ad- 
ditional costs and it is implementation dependent. For a detailed analysis see for 
example [Huss-Lederman et al. 1996; Higham 1990]. 

The efficient computation of MMM can be further improved in case of finite field 
and in particular for finite field F 2 the Method of four Russian for Multiplication 
(M4RM) is a fast MMM algorithm, which cost is 0(n 3 / log n) [Arlazarov et al. 1970; 
Aho et al. 1974; Albrecht et al. 2010]. Its asymptotic cost is better than classical 
matrix-matrix multiplication; but it is worse than recursive Strassen's algorithm. 
However, if the actual cost of M4RM is about C 3 n 3 / logn, we have that C 3 -c C\ so 
that is competitive for not too big matrices. A combination of Strassen and M4RM 
is a good compromise for faster matrix-matrix multiplication [Albrecht et al. 2010]. 

The fast decomposition of an n x m matrix with entries in a finite field F 2 is 
an important issue in algorithmic number theory and cryptanalysis [Shoup 2009; 
Bach and Shallit 1996]. In fact, some problems in cryptanalysis and number theory 
can be transformed in one involving a linear system with entries in F 2 . The exis- 
tence of solutions of a linear system can be deduced by analyzing the rank of the 
corresponding matrix. 

In this paper we propose a new efficient algorithm to perform the matrix factor- 
ization for large dense rectangular matrices with entries in F 2 . It uses an efficient 
implementation of M4RM algorithm and uses only row permutations. In section 2 
our notation is given and an appropriate data structure to store and manipulate the 
matrix is described. In section 3 the non-recursive block algorithm is presented. 
In section 4 the recursive version of the main algorithm is described. In section 5 
some details about the choice of some parameters are given. In section 6 some tests 
comparing our algorithm with Sage packages [Stein et al. 2012] are presented. 
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2. THE USED MATRIX DATA STRUCTURE 

Let A be an n x m matrix having entries in F 2 , i.e. A G Mat(n, m, F 2 ), denoted as 

/ ao,o ••• a 0l m-i \ 
A= : -. : 

We adopt the following convention for intervals of indices a..b = {a, a+l, . . . ,b— 1} 
and so, for instance, the sub-matrix A a b c d e Mat(o — a, d— c, F 2 ) of A represents 
the intersection of rows of index from a to b - 1 and columns of index from c to 
d—1. The sub-matrix A a . b .. e Mat(6 — a, m, F 2 ) of A is composed by the rows of 
index from a to b — 1. Furthermore, we denote by rank(A) the rank of A, i.e. the 
maximum number of linearly independent row (or column) vectors of A. 

We are interested in the computation of the factorization of large dense matrices 
that do not fit into the cache. Thus, a good arrangement of the elements of the 
matrix in memory is important for an efficient data retrieval. Moreover, bits are 
naturally grouped in words whose size is a power of 2, typically 32, 64 or 128 for 
larger architectures. An element of F 2 is naturally represented as one bit, so that 
elements in F 2 are naturally grouped in words of 32, 64 or 128 bits. In particular, a 
string of elements in F 2 is packed in an (unsigned) integer. The advantage of stor- 
ing multiple elements of F 2 as an integer is that it guarantees a natural parallelism 
of some operations. 

From now on, we denote with b the number of bits of the computer architecture, 
i.e. the number of bits in one machine word. Given two integers x and y whose 
bits represent elements in F 2 , the operation of exclusive or, denoted by x © y, is the 
sum © in F 2 applied to x and y bitwise; the and operation, denoted by x y, is the 
multiplications in F 2 applied to x and y bitwise. Infact, we have the following 
formulas 



b b b b 

x = ^x 4 2 4 , y = ^? /i 2\ xffiy = ^(x l ©y 4 )2\ x0y = ^(i t 0y,)2». 

4=0 4=0 4=0 4=0 

The entries of a matrix are packed into integers that represent groups of elements of 
the matrix itself. In particular, b consecutive entries in one row are packed into one 
integer. The way the integers are arranged changes how you access the elements 
of the matrix and you implement the elementary operations. 

A matrix A e Mat(n, m, F 2 ) is stored into a matrix of non-negative integers 
A G Mat(n, fi, N) with [ib - b < m < \xb. To access to the element a io of A, we have 
to determine the corresponding column-block q and then the right bit. Precisely, 
using integer division with remainder j = qb + r, (0 < r < b), the element ay- 
corresponds to the r-th bit of the integer Ai q . 

Remark 2.1. According to the previous notation, the least significant bit is on 
the left respect to the most significant bit, i.e. we are using big endian bit order. 
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Fig. 1. 



We consider a trivial example. Let 6 = 3 and A be the following (4 x 5)-matrix: 



A = 



Since m = 5 is not a multiple of b = 3, we directly memorize the matrix 
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in a 4 x 2 matrix of 3 bit integer as follows 
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The elements of A. can be organized using row-major order or column-major order. 
Our algorithm works better using column-major order. For example matrix A in (1) 
is stored as 

(5 1 3 4 | 3 2 1 3) . 

Due to the way the matrix is stored, it is clear that the operations acting on rows or 
block of b consecutive columns (stored as one integer) should be preferred. Opera- 
tions acting on isolated columns should be avoided. In fact, the cost of an operation 
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on a single column on matrix A is about equal (or more) the cost of the same oper- 
ation performed on a group of b columns when the columns are stored in a single 
column of the integer matrix A. 

3. BLOCK DECOMPOSITION 

In this section we give a (non-recursive) block decomposition that holds for matri- 
ces with entries in any field. For simplicity of notation, we are going to describe it 
for matrices in F 2 . Let consider A e Mat(n, m, F 2 ) such that it can be split in four 
blocks 



B 


C 


D 


E 



where the block B e Mat(r, b, F 2 ), with r < b, has full rank and the rows of D are 
linearly dependent on those of B. In other words, it holds that D = YB. 
The factorization is based on the identity: 
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(2) 



Due to the previous identity, we have that 
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and notice that rank(A) = rar\k(E © YC) + rank(B) = rank(£ © YC) + r. The 
sub-matrix E ffi YC e Mat(n - r,m - b, F 2 ) is the Schur complement of A (see 
[Haynsworth 1968; Zhang 2005]). Thus, we have reduced the decomposition to a 
smaller problem. Applying the same idea as before we can reduce the decomposi- 
tion to smaller and smaller problems with a reduction steps that can be described 
as follows. 

Let A^ = A and P(°) be a permutation matrix such that P^A^ can be parti- 
tioned as 



p(0) A (0) 



B (0) 


C (0) 


£>(0) 


E (0) 



, where B {0) e Mat(r , b, F 2 ), rank(B (0) ) = r < b 



and the rows of are linearly dependent on those of B^ a \ i.e. = Y^B^°\ 
Using the products in (2) we obtain the following decomposition 



p(o) A (o) = 



I 







I 





y(0) 


/ 










B (0) 


C (0) 





/ 



where A^ = E^ © Y^C^ G Mat(n - r ,m - 6, F 2 ). Now, we can apply the 
same idea to the Schur complement A^\ Let be a permutation matrix such 
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that 



P W A W 



B (l) 


C (l) 







, where B (1) e Mat(n, 6, F 2 ), rank(B (1) ) = n < & 



and I)W = Y^B^K Due to the decomposition (2), we reduce further the prob- 
lem of the decomposition of to the decomposition of = E^®Y^C^ <E 
Mat(n — ro—ri,m — 2b, F 2 ). 

Going forward, at the j th stage, the original matrix A is transformed into the 
sub-matrix A^; notice that rank(A) = rank(A^) + r + r\ + ■ ■ ■ + rj-i. Next, we 
find a permutation such that A^ can be partitioned as follows 



p(j) _ 
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with e Mat(rj,6,F 2 ), 



(3) 



with < 6 and B^ full rank and = yWflW. Using (2) again, we have 
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where 



A U+i) =E U) ©y0')c (j) . 



(4) 



Note that A^ +v > is the Schur complement of P<J) A<i) in F 2 and, after /x steps, we 
obtain 
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(5) 



where Z, is non-singular lower triangular matrix and (7 is the full rank upper block 
staircase. Let p(^ _1 ) be a permutation matrix such that pO* _1 ).4V _1 ) can be par- 
titioned as 



pO*-i)_4(M-i) = 



_B(m-i) 



where B^" 1 ) e Mat(r At _i,6',F 2 ), 



with b' = m - (fj, - 1)6 that satisfy r M _i < b' <b and B^ -1 ) is full rank. 
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Moreover, ^ = ^B^ ^ and the decomposition (5) becomes 
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where L is full rank lower trapezoidal matrix and U is the full rank upper block 
staircase. The previous steps can be resumed in the Lemma: 

Lemma 3.1. Given any (rectangular) matrix A, there exist a permutation P such 
that 



PA = LU 



where U is full rank upper (block) triangular matrix and L is a full rank lower trape- 
zoidal matrix. 

Clearly, rank(A) is the rank of the matrix U which is the number of its rows: 
rank(A) =^=o^- 

Observe that the computation at the step j involves matrix-matrix multiplications 
to obtain the Schur complement (4) which are the mostly costly operations of the 
presented algorithm. The multiplication of a p x 6-matrix by a b x q-matrix, in case 
p » b, can be efficiently performed using the M4RM algorithm [Arlazarov et al. 
1970; Aho et al. 1974; Albrecht et al. 2010]. Thus, in the computation of A^^ 
it is convenient to use the M4RM algorithm, because this block operation is more 
efficient than the usual row operations. 

This decomposition is based on the selection of and the construction of 
at each step. Efficient algorithms for this will be discussed in the next sections. For 
this purpose the incremental construction of an inverse of a matrix and pseudo- 
inverse construction is a necessary tool. 

Remark 3.2. The decomposition described in Lemma 3.1 when b = 1 is equiv- 
alent to the PLE factorization described in [Albrecht et al. 2011; Jeannerod et al. 
2011]. However computation with b — 1 is not convenient losing natural paral- 
lelism of integer operations. 

3.1 The computation of Y^ 

The permutation P applied to matrix A results in 



PA = 



B 


C 


D 


E 



where the rectangular matrix B e Mat(r, b, F 2 ), with r < b, has full rank and its 
satisfies D = YB for an opportune matrix Y which we have to determine. 
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In case r = b, since B e GL(b, F 2 ) we easily deduce that Y = DB 1 . Instead, 
when the matrix B is full rank with less rows than columns, a pseudo-inverse B^ 
has to be computed. For example the pseudo-inverse of Moore-Penrose [Ben-Israel 
and Greville 2003] is given by 

B^ = B T (BB T )- 1 

and satisfies 

BB f = BB T (BB T ) 1 = J, 

DB^ = YBB^ = Y. 

Thus, Y can be computed by simple right multiplication by a pseudo-inverse of B. 
Although the use of the Moore-Penrose's pseudo-inverse is correct, a more effi- 
cient pseudo-inverse can be constructed. Let J e Mat(6, r, F 2 ) be an insertion 
matrix whose effect is to insert b — r zero-rows into a matrix with r rows. Let 
R G Mat(6, b, F 2 ) be the matrix containing linearly independent rows which makes 
the square matrix JB + R non-singular. Notice that J T is a projection matrix which 
satisfies 

J T J = I, J T R = 0. (6) 

If r = b, i.e. if B is square and non-singular, insertions are not necessary and we 
have J = I, R = 0. 

Example 3.3. In case 6 = 3 and r = 2 we can see a situation as 





"1 0" 




"0 0" 


Jj = 





Rj - 


1 




1 








Let consider Z = (JB + R) 1 J. Due to the properties in (6) and the relation 
D = YB, we have 

BZ = B(JB + R) 1 J = J T JB(JB + R) 1 J 
= J T (JB + R)(JB + R) 1 J = J T J = I 
DZ = YBZ = Y. 

Thus the matrix Z has the same effect of the pseudo-inverse (it is a pseudo-inverse 
different from the Moore-Penrose one) and it is used in the computation of Y 
during the factorization procedure. In order to build Z, we need an algorithm to 
compute the inverse of a (small) square matrix in F 2 ; it will be treated in the next 
section. 

3.2 Incremental construction of the inverse of a square matrix in F 2 

Let B e GL(&, F 2 ) with all principal minors not-singular, it is possible to incremen- 
tally build its inverse Z. This requirement is not restrictive because every non- 
singular matrix by a row permutation satisfies it (due to Gauss LU decomposition, 
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see [Kincaid and Cheney 2002] page 156 Theorem 1). Let B k be the fc th principal 
minor of B and Z k its inverse. We can directly obtain from B^ 1 using the 

following factorization 



Bk+i 
which holds when 
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Remark 3.4. In case of W q , the previous condition becomes a — r T Z k c = (3 
where a, [3 e ¥ q and /3 ^ 0. Moreover, the last block in (7) becomes 
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Inverting factorization (7) we get immediately: 
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(Notice that Bi is a 1 x 1 matrix and thus i?i = Zi = [1]). Therefore, due to (9) 
we can compute Z k+1 from Z k as 



Z fe+ i - Bfc+i - 



Z fe ©(Z fc c)(r T Z fe ) Z fe c 



1 



(10) 



which needs two matrix-vector multiplication and a rank one update. Moreover, 
setting c = Z^c, the last matrix in (10) can be written as a matrix-matrix product: 
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• (11) 



Let M k be the matrix obtained multiplying by Z k the first k rows of B: 
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Due to (11), the update of M k+1 results in 
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and thus, Hk+i is used in updating both Mk+i and Zk+i- Moreover, we obtain the 
following update formulas: 



H 



fe+i 



c 




I © cr T c 




C 




C®cs T 


_e T _ 




r T 1_ 




e T _ 




s T 



where s T = r T C ffle T '. 



Note that it is possible to perform the incremental build when all principal minors 
are non-singular, which is equivalent to satisfy equation (8) for all fc. Such a con- 
dition is used to dynamically select linear independent rows in the construction of 

BCi). 

3.3 Construction of row permutation 

The steps of reduction from A^) to A^ +r> need the computation of permutation 
matrix P^O which permutes the rows of A^ in order to obtain A^ partitioned 
as in (3). The block must be non-singular with all principal minors non- 
singular. This permutation can be computed using only the first b columns of A^> 
and must satisfy 



1 A »,o..b - 



The selection of the rows and the construction of the inverse of B^O are done 
together using incremental update of Section 3.2, where the selected rows must 
satisfy condition (8). Observe that to check condition (8), the £-row of the sub- 



matrix A^\ b is partitioned as A\ J, Q b = (r T ,a, 1 ) where the 1 portion of the 
row is ignored in the computation. 

3.4 Insertion of linearly independent rows 

In the computation of row permutation it may happen that condition (8) is not sat- 
isfied by all the rows of the column block b . Obviously, such a situation does 
not arise if n > m and when the matrix is full rank. 

At this stage, standard algorithms introduce column permutations to satisfy condi- 
tion (8) . If such a condition cannot be satisfied even using column permutations, 
it means that last rows are linearly dependent on the previous ones and then the 
algorithm must end. Column permutations are not executed in our algorithm, so 
that new linearly independent rows are inserted using the two matrices J and R, 
introduced in Section 3.1. We observe that it is easy to build a row that satisfies con- 
dition (8) and that, in particular, the row (r T , a, (3 T ) = (0 T , 1, T ) trivially satisfies 
it. In practice, the presented process mixes row permutations and row insertions; 
it can be respectively split in a row permutations followed by a rows insertions: 



U) a CO - 



,0..b 



JiB U) + Ri 



Notice that the permutation and the insertion Jj are chosen in such a way the 
square block JjB^ + Rj has all principal minors non-singular and satisfies (6). 
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The following procedure performs the operation described in Sections 3.2-3.3 
and 3.4 to obtain an incremental construction of the pseudo-inverse Z. 



Procedure buildZ(.4, r) 


1 i b <- 0; 


2 for j 


= 0..6 do Zj <- 2i; Mj <- V; V 3 <- -1; 


3 for fcbit = 0..6 do 


4 


c 


<- 2 feb "; 


5 


for i = 0..fcbit do 


6 




if 7W 4 2 fcbit ^ then c <- c © 2 l ; 


7 


end 


8 


for i = i b ..r do 


9 




if popCount (c(D At) is odd then 


10 




| 'Pk bit ^i; Ai^A lb ; M kbH ^A lb ; i b <-i b + l; break; 


11 




end 


12 


end 


13 


y 


M-k bit ; II Update of Z and M 


14 


for j = 0..fc b ;t do 


15 




if y then Z febit <- Z febit e Z f , M kbit <- M febit © M,-; 


16 


end 


17 


for j = O-.fcbit do 


18 




if c then ^ <- Z febit © Zf, Mj <- M kbit © Mf, 


19 


end 


20 end 




21 m <- 0; 


22 for i 


= 0..&do 


23 


if Vi = -1 then 






// m is the integer with bits complemented 


24 




for j = 0..6 do Zj <- (Zj/2 m)© m); 


25 


else 


26 




m «— 2m + 1; 


27 


end 


28 end 





4. BLOCK RECURSIVE ALGORITHM 

In this section we describe a recursive version of the algorithm presented in Sec- 
tion 3. Let consider A e Mat(n,m,F 2 ) splitted into two sub-matrices A L e 

Mat(n,p, F 2 ) and A R e Mat(n, m-p, F 2 ) with p = [m/2] such that 

A=[A L \A R \. 

Applying the decomposition in Lemma 4.1 to the left part A L , we have PA L = LU 
and so 

PA = [PA L \PA R ] = [LU\PA R ] . 
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Since PA R 
we obtain 



D 



, where C e Mat(Y, m-p, F 2 ) and D e Mat(n -r,m-p, F 2 ), 
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Let A' = D © L r . .„,.i * r .C, we can recursively apply the factorization P'A' 
L'E/' and we obtain 



PA = 
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. 





Remark 4.1. Notice that here we perform the matrix-matrix multiplications us- 
ing our implementation of the Strassen's algorithm. 

5. PERFORMANCE TUNING 

The algorithm presented in Section 3 completely avoids column permutations and 
the cost of the decomposition is given by the cost of the matrix-matrix multiplica- 
tion. The cost to perform the M4RM algorithm to multiply A e Mat(ra, 6, F 2 ) and 
B G Mat(fr, b, F 2 ) is approximately given by the costs of the Xor operations. In 
particular, if we neglect the cost of the memory access and other minor costs, we 
obtain that 



cost of M4RM = + nR b c , 



(2 C - l)Xor 6 , = 



Xor fe (12) 



where 

(1) c is the size of the tables used to perform M4RM algorithm; 

(2) T b c is the cost of the tables construction; 

(3) R b c is the cost of the rows operations; 

(4) Xor;, is the cost of the Xor operation using integer of b bit size. 

In the following subsection we are going to discuss about the choices of the 
parameters b and c. Moreover, we will give an idea about the switching point from 
the M4RM multiplication and the Strassen multiplication. 
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5.1 Integer word-size and performance of M4RM 

Let us consider the two matrices A g Mat(n, 2b, F 2 ) and B g Mat(26, 2b, F 2 ). Due 
to (12) and using 2b bits as word-size, the cost to perform the M4RM to multiply A 
and B is approximately T 2b + nR 2b . On the other hand, using a word-size of b bits, 
the cost of M4RM becomes AT b + nAR b . 

Considering the cost of the rows contribution and the respective ratio, we have that 

Cost of 6-bits row _ AnR b _ 2Xor fc 
Cost of 2fc-bits row ~~ nR 2b ~ Xor 2h 

and so, in case of Xor 2 fc < 2Xorb, it is convenient to use 2b bits. 
The cost of the construction of the tables is given by AT b and T 2b and he corre- 
sponding ratio is then 

Cost per 6-bits tables _ AT b _ 2Xor fc 
Cost per 26-bits tables ~~ T 2b ~ Xor 26 ' 

Since the cost of the Xor^ « Xor 2 {,, it is convenient to chose 2b as word-size. 

Once the size b has been chosen, should be more convenient to choose the size c 
of the tables that are used for the M4RM algorithm. Since the cost to perform the 
product of the two matrices A g Mat(ra, b, F 2 ) and B g Mat(6, b, F 2 ) is given by 
(12), the optimal table size c (when b and n are known) can be estimated minimiz- 
ing (12), which is the minimum of the following function 

C(b,c,n)= Q (2 c -l + n). 

Remark 5.1. Actually, to determine the minimum of C is complicated. However, 
it can be observed that C, as a function of n, is a straight line with a slope that 
decreases when c increases. So, when n exceeds the value C(b, c, n) = C(b, c+1, n), 
it is convenient to use a table of size c+1 instead of size c. In the case of the 
product of square matrices, the minimum cost is obtained minimizing C(b,c,b) 
(with respect to c) and we obtain the following values: 

argmin{C(32,c,32)} w 4.08 argmin{C(64, c, 64)} w 4.77 

C C 

argmin{C(128,c, 128)} « 5.5. 

c 

In Table II we report the costs to perform product of square matrices of size 32, 
64 and 128. 

5.2 How to choose switching point for Strassen Matrix-Matrix multiplication 

The cost to multiply A g Mat(n, b, F 2 ) and B g Mat(6, b, F 2 ) using the M4RM 
algorithm is expressed in (12). Starting by the previous formula, we can easily 
obtain the cost to multiply (with M4RM algorithm) two matrices of size n. 

Let N = n/bbe the number of blocks in which we divide the matrix. We must 
apply TV 2 times the M4RM algorithm and we obtain the following cost 



M(n) - N 2 (T b c + nR h c ) = ^ (n 2 (2 c - 1) + n 3 ) 
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Table I. Cost of tables construction and cost of the rows operation for M4RM. 

CAB C A B 







32 bits 




64 bits 


128 bits 




T c 


1000 x R c 


T c 


1000 x R c 


T c 


1000 x Rc 


2 


0.028 


10.22 


0.048 


15.80 


0.086 


49.56 


3 


0.044 


6.25 


0.066 


10.77 


0.151 


34.19 


4 


0.069 


5.18 


0.099 


9.71 


0.223 


21.89 


5 


0.140 


3.78 


0.151 


7.30 


0.349 


19.89 


6 


0.222 


2.91 


0.254 


6.07 


0.688 


17.25 


7 


0.252 


2.82 


0.465 


5.06 


2.322 


15.72 


8 


0.484 


2.44 


0.721 


4.73 


3.734 


14.52 


9 


0.772 


2.40 


1.709 


4.37 


7.446 


16.12 


10 


4.227 


1.83 


3.465 


4.98 


11.364 


16.87 



Normalized cost 





16 x T c 


16000 x R c 


4 x Tc 


4000 x R c 


T c 


1000 x Rc 


2 


0.444 


164.80 


0.192 


62.90 


0.086 


49.40 


3 


0.700 


101.20 


0.264 


43.10 


0.151 


34.06 


4 


1.112 


83.00 


0.398 


38.82 


0.223 


23.08 


5 


2.248 


60.36 


0.606 


29.20 


0.349 


19.89 


6 


3.560 


45.80 


1.016 


24.26 


0.688 


17.28 


7 


4.028 


46.64 


1.862 


20.30 


2.322 


15.73 


8 


7.748 


38.20 


2.886 


19.20 


3.734 


14.49 


9 


12.348 


37.64 


6.838 


17.20 


7.446 


16.16 


10 


67.640 


29.32 


13.862 


20.04 


11.364 


16.83 



Time measured in microseconds 



To perform Strassen matrix-matrix multiplication algorithm, our implementation 
needs essentially 22 additions and 7 multiplications of matrices having size ^. Since 
the cost to compute one addition is given by niVXor b /4, the whole cost for Strassen's 
algorithm is 

S(n) = -^Xor b + 75 (-) . 
Then, it is convenient to use Strassen's algorithm as long as S(n) < M(n) or 

S(n) = ^-Xor b + 7S(n/2) < ^-Xor b + 7M(n/2) < M(n) . 
In other words, it is convienient to use Strassen's algorithm when 



n>44c+6(2 c -l). 
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Table II. Cost of the product of 2 square matrices having size respectively 32, 64, 128. The normalized 
costs are also reported. 

C A B 

C A B ^S^S 







32 bit 




64 bit 


128 bit 


bit 


32 x 32 


normalized (x64) 


64 x 64 


normalized (x8) 


128 x 128 


2 


0.362 


22.88 


1.060 


8.48 


6.43 


3 


0.255 


16.16 


0.755 


6.04 


4.51 


4 


0.240 


15.20 * 


0.725 


5.80 


2.88 


5 


0.262 


16.64 


0.625 


5.00 * 


2.90 


6 


0.312 


20.00 


0.645 


5.12 


2.89 * 


7 


0.345 


22.08 


0.790 


6.32 


3.91 


8 


0.570 


36.64 


1.025 


8.16 


5.52 


9 


0.845 


54.40 


1.955 


15.60 


9.57 


10 


4.355 


279.80 


3.635 


29.08 


13.56 



6. PERFORMANCE TESTS AND COMPARISON 

To evaluate the performance of our algorithm, we compute the decomposition of 
n x n matrices with entries in F 2 . Our block decomposition works well both for 
dense matrices and for relatively sparse matrices. Notice that in the former case 
the obtained matrices have rank most probably equal to n — 1 (or n); in the latter 
case we have low-rank matrices. 

First, we have constructed a sample of random dense matrices of size n (where 
the size n ranges from 256 and 65536). 

In Table III we give the minimum value of ten observed running times for comput- 
ing our block decomposition (recursive and non-recursive), in case b = 32, 64, 128. 



Table III. Compare running times with b = 32, 64, 128 recursive (rec) and non recursive versions of our 
algorithm. Execution obtained using LLVM 3.0 compiler (MAC OSX) on a 3.06GHz Intel(R) Xeon. 





32 


32 rec 


64 


64 rec 


128 


128 rec 


n 


milliseconds 


256 


0.106 


0.108 


0.131 


0.133 


0.247 


0.250 


512 


0.347 


0.350 


0.327 


0.330 


0.553 


0.557 


1024 


1.574 


1.741 


1.070 


1.171 


1.443 


1.527 


2048 


9.260 


10.862 


5.076 


5.927 


5.172 


5.795 


4096 


64.849 


76.458 


32.454 


37.996 


27.274 


30.792 


n 


seconds 


8192 


0.492 


0.549 


0.248 


0.267 


0.203 


0.206 


16384 


3.922 


3.981 


1.939 


1.981 


1.646 


1.523 


32768 


32.011 


29.791 


23.859 


15.440 


12.884 


11.498 


65536 


253.169 


216.323 


190.065 


114.794 


102.071 


86.156 



In Table IV, the minimum running time of ten trials to obtain a matrix decompo- 
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sition is given. In particular, we compare the running times to obtain M4RI, PLUQ, 
PLE decompositions adopted into SAGE [Stein et al. 2012] with the ones to get our 
128 block recursive decomposition. 



Table IV. Compare running times with three algorithms (PLUQ, M4RI, PLE) adopted into SAGE and our 
recursive algorithm with b = 128 using LLVM 3.0 compiler (MAC OSX) on a 3.06GHz Intel (R) Xeon. 





M4RI 


PLUQ 


PLE 


our 


n 


milliseconds 


256 


0.248 


0.239 


0.239 


0.252 


512 


0.847 


0.836 


0.821 


0.562 


1024 


3.428 


2.653 


2.551 


1.539 


2 048 


12.810 


10.509 


10.003 


5.850 


4 096 


62.231 


53.686 


50.400 


31.406 


n 


seconds 


8192 


0.349 


0.340 


0.334 


0.207 


16 384 


2.462 


2.425 


2.392 


1.532 


32 768 


18.674 


18.558 


18.341 


11.529 


65 536 


135.906 


128.567 


125.783 


86.538 



Then, we constructed low-rank matrices as follows. We consider the samples 

S n ~ {S n .i i — 2, ... , n\ 

of 39 relatively sparse matrices of size n having respectively i non-zero elements 
per rows. 

In Figure 2 we have plotted the observed running times (in milliseconds) for 
M4RI (cross), PLUQ (square), PLE(circles) and our 128 bit block recursive (trian- 
gles) decomposition. This is done for every matrix in the sample S n The size n 
ranges from 1024 to 65536. 
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